A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.
QUESTION|TASK 1
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”).In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.
So I went through the dataset and made sure every instance (row) had all properties (column) and that if the value isn’t present it was set to pandas.NA. After that, I checked to see if any of the properties that ought to be ints are failing integer casting, and attempted to fix them by casting that value as a string, removing all non-numeric characters from the value and casting it back to an int. I then handle all instances of empty strings by setting them to pandas.NA and then cast every column to its proper type. I also removed all instances(rows) that didn’t have a clear month, assuming that rows that don’t give a discernable month are useless and a potential skew of the data. Then i fixed a misspelling of february i found. If this was a real project, I’d create a supplamentary table to map airport codes to airport names, but that wasn’t part of the assignment
Show the code
from p3_source import q1_example, dffrom pandas import NA, isnaq1_example.head().style.format(na_rep="NaN")
airport_code
airport_name
month
year
num_of_flights_total
num_of_delays_carrier
num_of_delays_late_aircraft
num_of_delays_nas
num_of_delays_security
num_of_delays_weather
num_of_delays_total
minutes_delayed_carrier
minutes_delayed_late_aircraft
minutes_delayed_nas
minutes_delayed_security
minutes_delayed_weather
minutes_delayed_total
2
IAD
NaN
January
2005.000000
12381
414
1058
895
4
61
2430
NaN
70919
35660.000000
208
4497
134881
9
IAD
Washington, DC: Washington Dulles International
February
2005.000000
10042
284
631
691
4
28
1639
15573.000000
39840
NaN
169
1359
78878
12
SFO
San Francisco, CA: San Francisco International
February
2005.000000
9327
599
457
1010
0
57
2122
30760.000000
27302
NaN
6
3178
110995
13
SLC
NaN
February
2005.000000
12404
645
463
752
10
79
1947
32336.000000
23087
24544.000000
293
4614
84874
14
ATL
Atlanta, GA: Hartsfield-Jackson Atlanta International
March
2005.000000
37806
1462
999
5697
11
423
9431
NaN
150766
396191.000000
568
33379
691887
QUESTION|TASK 2
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
pandas dataframe styler is giving me trouble, but here’s the data that indicates that SFO airport leads in both average time of delay per flight through all months and percent of flights that have a delay
Show the code
# Include and execute your code herefrom p3_source import q2_datadef hours_format(x):if pd.isna(x):return"" h =int(x) m =int(round((x - h) *60))returnf"{h}h {m}m"if m elsef"{h}h"def percent_format(x):returnf"{x*100:.1f}%"if pd.notna(x) else""q2_data.style.format( {"delay_percent": percent_format, "avg_delay_per_flight_hours": hours_format})q2_data
avg_delay_per_flight
total_flights
total_delays
delay_percent
avg_delay_per_flight_hours
airport_code
ATL
11.971153
4295863
868546
0.202182
0.199519
DEN
9.939602
2455899
455603
0.185514
0.165660
IAD
11.985960
831904
164936
0.198263
0.199766
ORD
15.514155
3516798
815073
0.231766
0.258569
SAN
8.978680
877340
167483
0.190899
0.149645
SFO
16.175981
1591177
415090
0.260870
0.269600
SLC
7.176067
1347833
196873
0.146066
0.119601
QUESTION|TASK 3
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
I’ve already excluded rows that didn’t provide month data, so all records represent an individual month/airport combo. Since the question specified delays of any length, delay time will not be considered. based on sums of delayed flights from all airports for each month combined with sums of total flights, it seems clear that september at 16.5% missed flights and december at 16.7% missed flights.
Show the code
from p3_source import q3_data, q3_chartq3_data
total_flights
total_delays
delay_percent
month_no
month
4
1193018
265001
0.222127
1
January
3
1115814
248033
0.222289
2
February
7
1213370
250142
0.206155
3
March
0
1259723
231408
0.183698
4
April
8
1227795
233494
0.190173
5
May
6
1305663
317895
0.243474
6
June
5
1371741
319960
0.233251
7
July
1
1335158
279699
0.209488
8
August
11
1227208
201905
0.164524
9
September
10
1301612
235166
0.180673
10
October
9
1185434
197768
0.166832
11
November
2
1180278
303133
0.256832
12
December
Show the code
q3_chart
QUESTION|TASK 4
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
b. 30% of all delayed flights in the Late-Arriving category are due to weather
c. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
I created a map (average_weather_rates) based on the rules above that gives the average percent of late_aircraft or nas delays by month, and then created a dataframe that puts the late_aircraft multipled by the percent for the row’s month in a late aircraft weather column, and the same for nas delays. I also added columns that sums late aircraft weather delays and nas weather delays, and one that sums that sum and the actual weather delays to get total mild and severe weather delays.
The values for the mild weather delays are by nesessity not integers, ’cause they’re based on extrapolation from the monthly weather delay proportions set down in the rules. Assuming the rates in the rules represent averages based on representative data, then my data will faithfully extrapolate approximate values based on those average rates. I’m leaving them as floating-point vals instead of rounding to ints so as to no obfuscate information.
Show the code
from p3_source import q4_dataq4_data
airport_code
month
year
total_flights
total_delays
late_delays
late_weather_ext
nas_delays
nas_weather_ext
weather_mild_total
weather_severe
weather_all
0
ATL
January
2005.0
35048
8355
999
299.7
4598
1839.2
2138.9
448
2586.9
1
DEN
January
2005.0
12687
3153
928
278.4
935
374.0
652.4
233
885.4
2
IAD
January
2005.0
12381
2430
1058
317.4
895
358.0
675.4
61
736.4
3
ORD
January
2005.0
28194
9178
2255
676.5
5415
2166.0
2842.5
306
3148.5
4
SAN
January
2005.0
7283
1952
680
204.0
638
255.2
459.2
56
515.2
...
...
...
...
...
...
...
...
...
...
...
...
...
892
DEN
December
2015.0
18001
3610
1463
438.9
981
392.4
831.3
152
983.3
893
IAD
December
2015.0
2799
443
183
54.9
61
24.4
79.3
17
96.3
894
ORD
December
2015.0
25568
4233
1755
526.5
1364
545.6
1072.1
180
1252.1
895
SFO
December
2015.0
13833
4465
1180
354.0
2372
948.8
1302.8
147
1449.8
896
SLC
December
2015.0
8804
1745
796
238.8
404
161.6
400.4
56
456.4
897 rows × 12 columns
QUESTION|TASK 5
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
I agregated the weather breakdown data from q4 by airport, and then melted it so that there was a row for each airport/weather category combo (ATL Mild, ALT Severe, DEN Mild, DEN Severe, ect). then I stacked ’em so you could see them as sums. this view allowed me to see both the relative rates for each category of weather delay, but also the proportion of mild to severe weather delays. it looks like there were about 1 severe weather delays for every 10 mild weather delays. The two highest weather delays, san francisco and chicago make sense to me; the fog in SF is famous and chicago’s got all that cold wind off lake michigan
Show the code
from p3_source import q5_data_melted, q5_chartq5_data_melted
airport_code
delay_type
delay_rate
0
ATL
weather_mild_delay_rate
0.060603
7
ATL
weather_severe_delay_rate
0.007338
1
DEN
weather_mild_delay_rate
0.051904
8
DEN
weather_severe_delay_rate
0.005513
2
IAD
weather_mild_delay_rate
0.054624
9
IAD
weather_severe_delay_rate
0.005645
3
ORD
weather_mild_delay_rate
0.075767
10
ORD
weather_severe_delay_rate
0.005789
4
SAN
weather_mild_delay_rate
0.046532
11
SAN
weather_severe_delay_rate
0.004734
5
SFO
weather_mild_delay_rate
0.085776
12
SFO
weather_severe_delay_rate
0.006307
6
SLC
weather_mild_delay_rate
0.035481
13
SLC
weather_severe_delay_rate
0.004959
Show the code
q5_chart
STRETCH QUESTION|TASK 1
Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.